Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: server: make the span stats fan-out more fault tolerant #109008

Merged
merged 1 commit into from
Aug 21, 2023

Conversation

zachlite
Copy link
Contributor

@zachlite zachlite commented Aug 18, 2023

Backport 1/1 commits from #108456


This commit adds improved fault tolerance to the span stats fan-out:

  1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response in the newly added Errors field.

  1. Nodes must service requests within the timeout passed to iterateNodes. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves #106097
Epic: none
Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.


Release justification: low risk fault tolerance improvements

@zachlite zachlite requested a review from a team as a code owner August 18, 2023 15:50
@zachlite zachlite requested a review from a team August 18, 2023 15:50
@zachlite zachlite requested a review from a team as a code owner August 18, 2023 15:50
@zachlite zachlite requested a review from a team August 18, 2023 15:50
@zachlite zachlite requested a review from a team as a code owner August 18, 2023 15:50
@zachlite zachlite requested review from dhartunian and removed request for a team August 18, 2023 15:50
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a backport it needs to have that on the title, so # release-23.1: server: make... and then you need to add Release justification: to the PR description

Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

@zachlite zachlite changed the title server: make the span stats fan-out more fault tolerant release-23.1: server: make the span stats fan-out more fault tolerant Aug 18, 2023
Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @zachlite)


pkg/server/status.go line 2851 at r1 (raw file):

}

const noTimeout time.Duration = 0

on master this const is created on the file pagination.go, this file doesn't exist on this branch?

Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @maryliag)


pkg/server/status.go line 2851 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

on master this const is created on the file pagination.go, this file doesn't exist on this branch?

I'm working on a separate backport for #107796.
We can wait and rebase this PR on top of that upcoming backport.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @zachlite)


pkg/server/status.go line 2851 at r1 (raw file):

Previously, zachlite wrote…

I'm working on a separate backport for #107796.
We can wait and rebase this PR on top of that upcoming backport.

Sounds better! Let's get that one if first! thank you

@zachlite
Copy link
Contributor Author

will rebase on top of #109015

@zachlite zachlite force-pushed the backport23.1-108456 branch from f137ca2 to d266fc3 Compare August 18, 2023 18:48
@zachlite
Copy link
Contributor Author

@maryliag - rebase is done.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian)

@zachlite
Copy link
Contributor Author

Will rebase on top of the latest commits on release-23.1, since its been a few days. Just in case 👍

This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants